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MIXTURES OF REGRESSIONS 

By Andriy Norets 

Princeton University 

This paper shows that large nonparametric classes of conditional 
multivariate densities can be approximated in the Kullback-Leibler 
distance by different specifications of finite mixtures of normal regres- 
sions in which normal means and variances and mixing probabilities 
can depend on variables in the conditioning set (covariates) . These 
models are a special case of models known as "mixtures of experts" 
in statistics and computer science literature. Flexible specifications 
include models in which only mixing probabilities, modeled by multi- 
nomial logit, depend on the covariates and, in the univariate case, 
models in which only means of the mixed normals depend flexibly on 
the covariates. Modeling the variance of the mixed normals by flex- 
ible functions of the covariates can weaken restrictions on the class 
of the approximable densities. Obtained results can be generalized 
to mixtures of general location scale densities. Rates of convergence 
and easy to interpret bounds are also obtained for different model 
specifications. These approximation results can be useful for proving 
consistency of Bayesian and maximum likelihood density estimators 
based on these models. The results also have interesting implications 
for applied researchers. 

1. Introduction. This paper explores approximation properties of finite 
smooth mixtures of normal regressions as flexible models for conditional 
densities. These models are a special case of mixtures of experts (ME) in- 
troduced by Jacobs et al. (1991). ME have become increasingly popular is 
statistical literature since they are very flexible, easy to interpret and rea- 
sonably easy to estimate. See, for example, papers by Jordan and Jacobs 
(1994) and Jordan and Xu (1995) who employ the expectation maximization 
(EM) estimation algorithm or papers by Peng, Jacobs and Tanner (1996), 
Wood, Jiang and Tanner (2002), Geweke and Keane (2007) and 
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Villani, Kohn and Giordani (2009) who use Markov chain Monte Carlo meth- 
ods for estimation of ME in the Bayesian framework. This paper contributes 
to the literature that provides a theoretical explanation of the success of ME 
models in applications. In particular, I show that large classes of conditional 
densities can be approximated in the Kullback-Leibler (KL) distance by 
finite smooth mixtures of normal regressions. Approximation results are ob- 
tained in the KL distance for the following reason. If a data generating 
density is in the KL closure of a class of models then this density can be 
consistently estimated from data by these models under weak regularity con- 
ditions [see, e.g., Ghosh and Ramamoorthi (2003) for a textbook treatment 
of Schwarz's theorem on posterior consistency and Roeder and Wasserman 
(1997) for posterior consistency results for finite mixture of normals]. 

Consider a joint probability distribution F on a product space Y x X, 
Y C R d and X C R dx . Assume the conditional distribution F(y\x) has a 
density f(y\x) with respect to the Lebesgue measure. The marginal density 
of x with respect to some generic measure is denoted by f{x). A model 
A4 for the conditional density f{y\x) is described by p(y\x,A4). The KL 
distance between f(y\x)f(x) and p(y\x, M)f{x) is defined by 

This distance can also be interpreted as the expected KL distance between 
the conditional distributions. Either way, this is the distance useful for ob- 
taining estimation consistency results. Also, convergence in the KL distance 
implies convergence in the total variation distance. Below, I consider several 
different specifications of mixture of normal regressions models, p(y\x, A4), 
and provide conditions on F under which d^i J (F,M) can be made arbitrar- 
ily small. I also derive rates of convergence and easy to interpret bounds for 
d KL (F,M). 

In general, a finite mixture of normal regressions model can be written as 

in 

p(y|x,X) = ^a J m (xMy,/i™(x),af(x)), 

3=1 

where mixing probabilities satisfy a™"(x) € [0,1] and ^2jOi™"(x) = 1, and 
4>(y,fi, a) is a normal density with mean [i and standard deviation a eval- 
uated at y (if y is multidimensional then the variance-covariance matrix is 
diagonal a 2 1). Most of the results obtained in the paper can be easily ex- 
tended to models in which general location scale densities a~ d K((y — /jt)/a) 
are mixed instead of the normal densities (j)(y,fx,a). Models, in which the 
mixing weights depend on x, are referred in this paper as smooth mix- 
tures. In practice, a' l J l (x) , s are often modeled by a multinomial choice model, 
for example, multinomial logit [Peng, Jacobs and Tanner (1996)] or probit 
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[Geweke and Keane (2007)], or it might not depend on x. The mean [i™(x) 
can be constant, linear or flexible, for example, polynomial, in x. An expo- 
nentiated polynomial or spline in x can be used for modeling the standard 
deviation aj l (x) [Villani, Kohn and Giordani (2009)]. 

To the best of my knowledge, previous literature on smooth mixtures of 
regressions (or experts) does not provide a theory on what specifications for 
aj 1 , and a™ deliver a model that can approximate and consistently esti- 
mate large nonparametric classes of densities F. There are theoretical results 
on approximation of smooth functions and estimation of conditional expec- 
tations by ME [see Zeevi, Meir and Maiorov (1998) and Maiorov and Meir 
(1998)]. The only paper on approximation of conditional densities by ME 
seems to be Jiang and Tanner (1999) who develop approximation and esti- 
mation results for target densities from a single parameter exponential fam- 
ily, in which the parameter is a smooth function of covariates. A detailed 
comparison with results in Jiang and Tanner (1999) is presented in Section 
6. In this paper, I do not restrict the functional form of f(y\x) and use weak 
regularity conditions to describe a class of F that can be approximated. 
Conditions on approximable classes of f(y\x) and f{x) that are common 
for different model specifications include bounded support for f(x), conti- 
nuity of f(y\x) in (y,x), finite expectation of a change of log f(y\x) in a 
neighborhood of y and existence of the second moments of y. The latter re- 
striction can be weakened by adding densities with fat tails to the mixtures 
in addition to normal densities. 

In Section 4, I show that considerable flexibility is already attained when 
Oj^'s are modeled by multinomial logit with linear indices in x, and (/j,™ , a™) 
are independent of x. Results in Sections 3 and 4 suggest that using polyno- 
mials in the logit specification reduces the number of mixture components m 
required to achieve a specified approximation precision. As shown in Section 
5, models for univariate response y in which the mixing probabilities and 
the variances of the mixed normals are independent of x, and the means 
are flexible, for example, polynomial in x, can approximate large classes 
of f(y\x). Differences in quantiles of f(y\x) from these classes have to be 
bounded above and below uniformly in x. These restrictions on f(y\x) can 
be weakened if the variances of the mixed normals are modeled by flexible 
functions of x. Section 7 summarizes the findings. 

2. Infeasible model. In this section, I explicitly construct a smooth mix- 
ture of normals model that converges to a given F in the KL distance as 
m increases. This model is not feasible in the sense that it is not based on 
components employed in practice, for example, logit /probit mixing proba- 
bilities. However, the results for feasible models presented in the following 
sections follow from this one or are similar. 
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Let Aj 1 , j = 0,1, ... ,m, be a partition of Y consisting of adjacent half- 
open half-closed hypercubes A™, . . . , A™ with side length h m and the rest of 
the space A™. As m increases the fine part of the partition becomes finer, 
h m — >■ 0. Also, it covers larger and larger part of Y: for any y € Y there exists 
Mo such that 

(2.1) Vm>M C Sm (y)nA™ = 0, 

where Cs m (y) is a hypercube with center y and side length 5 m —¥ 0. It is 
always possible to construct such a partition. For example, if Y = [0, oo) 
let A™ = [logm,cx)), AJ 1 = [(j — 1) logm/m, j logm/m) for j ^ 0, and h m = 
logm/m. 

A candidate model A^o for approximating f(y\x) is 
m 

(2.2) p(y\x,M ) = J2 F ( A T\ x ) ( t ) (y^T^rn) + F(A^\x)cP(y,0,ao), 

j'=i 

where do is fixed, a m converges to zero as m increases and [J,™ is the center 
of A" 1 . One can always construct a model A^o and a partition AJ 1 so that 

(2.3) S m ^0, a m /S m ^0, 5 d m l h m /a d m ^0, 

for example, in the example for Y = [0, oo) from the previous paragraph let 

For a partition satisfying (2.1) and (2.3), let us introduce the following 
restrictions on F. 



Assumption 2.1. 1. f(y\x) is continuous in y a.s. F. 

2. The second moments of y are finite. 

3. For any (y,x) there exists a hypercube C{r,y,x) with side length r > 
and y € C{r,y,x) such that (i) 

(2.4) / log ■ f f{ylx) F(dy, dx) < oo 

and (ii) exists M3 such that for any m > M3, if y G A™ then C{r,y,x) n 
A™ contains a hypercube Co(r,y,x) with side length r/2 and a vertex 
at y and if y G Y \ A™, then C(r, y, x) C\ (Y \ A™) contains a hypercube 
C\{r,y,x) with side length r/2 and a vertex at y. 

Parameter <jq can always be chosen so that 

(2.5) l>2-( d+1 ) ><f>(y,0,a )\(C (r,y, X )), 



where A is the Lebesgue measure. 
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Proposition 2.1. If the model p(y\x,Mo) and the partition A™ are 
constructed so that (2.1), (2.2), (2.3) and (2.5) hold, and F satisfies As- 
sumption 2.1, then dKh(F, Mo) — > asm-) oo. 

The proposition is rigorously proved in the Appendix. Here, I briefly de- 
scribe the intuition behind the argument and the role of the assumptions. 
Convergence in the KL distance is proved by the dominated convergence 
theorem (DCT). First, I establish point- wise convergence of the integrand, 
log f(y\x)/p(y\x, Mo), to zero, and then I derive an integrable upper bound 
on the integrand for the DCT applicability. Nonnegativity of the KL dis- 
tance is fruitfully exploited in the proof as it allows working only with upper 
bounds and ignoring the lower ones in convergence arguments. 

The first term on the right-hand side of (2.2) (the sum from 1 to m) 
approximates the integral 

(2.6) J (f)(y,fi,(T m )f(fi\x)dfM = J f(y-a m z\x)(j)(z,0,l)dz, 

when h m is much smaller than a m , and the fine part of the partition is 
large. The integral on the right-hand side of (2.6) is obtained by the change 
of variables. For a small 8 m and z satisfying ||cr m z|| < S m , f(y — a m z\x) is 
close to f(y\x) as f(y\x) is assumed to be continuous in y. Therefore, when 
o~ m is much smaller than 5 m the right-hand side of (2.6) should be close to 
f(y\x). Thus, this intuitive argument explains the role of conditions (2.3) 
and continuity of f(y\x). 

The second term on the right-hand side of (2.2) converges to zero. This 
term is not needed for point-wise convergence. It can be omitted when the 
support of f(y\x) is bounded uniformly in x as in this case we can set 
A™ = and use the same variance in all mixture components (there is 
no need to define oo). This term together with part 2 of Assumption 2.1 
prevents tails of p(y\x, Mo) from becoming too thin relative to f(y\x) in the 
unbounded support case (in the absence of this term the tails would be too 
thin as a m — > 0). 

Parts 2 and 3 of Assumption 2.1 together guarantee existence of an 
integrable upper bound for the DCT applicability. An upper bound on 
log / '(y\x) I 'p(y\x , .Mo) involves a lower bound on p(y\x,Mo). Both terms on 
the right-hand side in the definition of p(y\x, Mo) in (2.2) can be bounded 
below by an expression proportional to mf z£C ( r y x } f(z\x). That is how con- 
dition (2.4) is deduced. The lower bound for the second term in (2.2) also 
includes cj)(y,0,ao) and that is why finiteness of the second moments of y is 
assumed. 

One interpretation of condition (2.4) [part 3(i) of Assumption 2.1] is that 
local relative changes in f{y\x) due to changes in y should not be infinitely 
large on average. It seems difficult to think of an unconditional density, which 



G 



A. NORETS 



is well behaved and positive everywhere, that would violate (2.4). This part 
of the assumption though can be violated by reasonable conditional densities 
as Example 2.1 below illustrates. 

When f(y\x) is positive everywhere, part 3(h) of Assumption 2.1 is not 
needed. It always holds if C(r,y,x) is a hypercube with center at y. Part 
3(h) becomes important when f(y\x) can be equal to zero. In particular, the 
sets Co(r,y,x) and Ci(r,y,x) in part 3(h) of Assumption 2.1 are introduced 
to specify that C(r, y, x) needs to be defined differently near the boundary of 
the support and in the tails if one wants to use condition (2.4) in its present 
form. This is illustrated in Figure 1. 

The support of f(-\x) should include C(r,y,x) a.s. F; otherwise, part 
3(i) of Assumption 2.1 is not satisfied. Therefore, for f(y\x) in Figure 1, it 
has to be the case that C(r,y,x) = [y,y + r] at the boundary of the sup- 
port (the intersection of the axes). Setting C(r,y,x) = [y,y + r] near the 
boundary of the support makes the ratio f(y\x)/inf zeC ^ ry x ^f(z\x) small- 
est possible (equal to one) and thus helps with condition (2.4). Parts of 
Y near the boundary of the support are covered by the fine part of the 
partition A™, . . . , A™ for all sufficiently large m, and part 3(h) of Assump- 
tion 2.1 holds for Ci(r,y,x) = [y,y + r /2]. Using C(r,y,x) = [y,y + r] for 
all y would not work. Since for any m one can find y € A™ such that 
C(r,y,x) DY\Aq l is arbitrary small, and part 3(h) of Assumption 2.1 fails. 
Thus, for y that are arbitrary far from the boundary of the support, one 
has to use C(r,y,x) = [y — r/2,y + r/2] eventually. Then, part 3(h) of the 
assumption clearly holds for C\(r,y,x) = [y — r/2,y], Co(r,y,x) = [y,y + r/2] 
and any m. 

Results in this section and similar results in the following sections can 
be generalized in several different ways. First, the derivation of the inte- 
grable upper bound in the proof of Proposition 2.1 suggests that the re- 
quirement of finite second moments of y can be weakened by adding a 
density with thicker than normal tails to the mixture of normals; for ex- 
ample, substitute 4>(y, 0, ctq) in (2.2) with a Student £-density. Second, more 
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Fig. 1. Construction of C(r,y,x). 
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general shapes of the support of F can be accommodated if instead of hy- 
percubes C(r,y,x), Co(r,y,x), and C\(r,y,x) in Assumption 2.1 different 
sets with positive Lebesgue measure are used. For example, if the support 
of f(-\x) is a triangle in R 2 then small triangles can be used instead of the 
squares C(r,y,x), Co(r,y,x) and C±(r,y,x). Third, general location scale 
densities a~ d K({y — fi)/(r) can be used in mixtures instead of normal den- 
sities. As long as analogs of Lemmas A.l, A. 2 and A. 3 (see the Appendix) 
are available for a particular type of densities, results in this and the fol- 
lowing sections will hold for mixtures of these densities. Lemmas A.l and 
A. 3 hold for a~ d K((y — pL)/a) if K(z) is bounded and nonincreasing in \z\ 
(proofs of the lemmas use only these facts about the normal distributions). 
The derivation of bounds in Lemma A. 2 exploits normality; however, the 
qualitative results of the lemma hold as long as f R K(z)dz = 1 and K(z) is 
positive in a neighborhood of zero. Thus, all the results in this paper that 
establish (Ikl(F,-M) — > do not depend on the normality assumption; how- 
ever, bounds and convergence rates for <1kl(F, -M) derived below are specific 
to mixtures of normal densities, and they might be different for mixtures of 
other densities. All these generalizations seem to be straight forward and I 
do not pursue them in this paper to keep the arguments short and simple. 

Examples below demonstrate that Assumption 2.1 is satisfied for a large 
class of densities. They also describe some situations in which the assumption 
fails. 

Example 2.1. Exponential distribution, f(y\x) = j(x) exp{— r y(x)y}, 
j(x) > 0. The density is continuous in y (part 1 of Assumption 2.1). Let 
f 7~ 2 dF < oo so that the second moment of y is finite (part 2 of Assump- 
tion 2.1). Define the partition A™ and C(r,y,x), Co(r,y,x) and C±(r,y,x) 
as shown in Figure 1, for example, for some r > let C(r, y, x) = [y, y + r] 
for y £ [0, r] and C(r, y, x) = [y — r/2, y + r/2] for y € (r, oo). Thus, from 
the discussion of Figure 1 above it follows that part 3(ii) of Assumption 
2.1 is satisfied. Because log f(y\x)/ mi z& c(r,y,x) f( z \ x ) <vy(x), part 3(i) of 
Assumption 2.1 holds as long as 7(2;) is integrable with respect to f(x). If 
7(3;) is not integrable, then part 3(i) of the assumption fails. 

Example 2.2. A Student t-distribution, in which scale and location 
parameters are functions of x, f(y\x) oc [i/ + ((y — b(x)) /c(x)) 2 ]~( u+1 ^ 2 , v > 2 
and b(x) 2 , c(x)~ 2 and c(x) 2 are integrable w.r.t. f(x). The second moment 
of y is finite since 
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As I discuss above, for densities positive everywhere part 3(ii) of Assumption 
2.1 always holds with C(r,y,x) = [y — r/2, y + r/2]. Part 3(i) of Assumption 
2.1 is also satisfied because 



[log . , mX \, , , F(dy,dx) 

J m izeC(r,y,x) J(Z\X) 




lo § ,, I 777. I I — 77777777777772 f\V\ x ) d yF{dx) 



xJb{x) 2 v+{(y + r-b{x))/c{x)f 



< {v + 1)2 / / [i/ + ((y + r - 6(x))/c(x)) 2 ]/(2/l^) dyF(dx) < oo, 

JX Jb(x) 

where the last inequality follows by the integrability of ((y — b(x))/c(x), its 
square and c(x)~ 2 . 

Example 2.3. Suppose that conditional density f{y\x) is continuous in 
y and bounded above and away from zero, oo > / > f(y\x) > f > for any 
y G Y = [a, b] and Then we can set Aff = 0. For r G (0, (6 - a) /4) , let 

C(r,y,x) = [y,y + r] and Ci(r,y,x) = [y,y + r/2] for y 6 [a, (a + 6)/2] and 
C(r,y,x) = [y-r,y] and Ci(r, y, x) = [y-r/2,y] for y G ((a+6)/2,6]. Clearly, 
part 3(h) of Assumption 2.1 is satisfied. Because /(y|x)/inf^ eC <( r . jy ^ 

x) < / / / part 3(i) of Assumption 2.1 also holds. The second moment of y is 
finite and thus all parts of Assumption 2.1 hold. 

The boundedness away from zero condition can be replaced by a mono- 
tonicity condition at the boundary of the support. For example, let f{y\x) 
be nondecreasing on [a, a + 2r], nonincreasing on [b — 2r,b) and bounded 
below by / > on [a + r,b — r\. In this case f{y\x)/ val z ^c(r,y,x) f( z \ x ) — 
max{l,///} for any y G [a, &]. Thus, part 3(i) of Assumption 2.1 holds. The 
other parts of the assumption are not affected by this change. 

Example 2.4. Consider a uniform distribution f(y\x) = x l^ x ^ (y) 
and f(x) > for any x G [l,oo). A natural choice of the partition would 
be A^ 1 = [mh m , oo) and Af = [(j - l)h m ,jh m ) for j G {l,...,m}. When 
y = x, the only reasonable choice of C(r, y, x) is C(r, y, x) = [y — r, y\. For an 
arbitrary m and y = a; = mh m + r/4, C(r,y,x) violates part 3(h) of Assump- 
tion 2.1 since the only possible Co(r,y,x) = [y — r/2,y] is not included in 
A™. For /(x) with bounded support, this example would satisfy Assumption 
2.1 since in this case we could set A™ = 0. 

This example illustrates that Assumption 2.1 rules out some cases in 
which the support of f(-\x) is increasing in x without a bound. In Section 5, 
I consider model specifications in which means and variances of the mixed 
normals can be flexible functions of x. Those specifications seem to be more 
promising for modeling densities f(-\x) with support increasing in x without 
a bound (see Example 5.2). 
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2.1. Approximation error bounds. The proof techniques of this section 
can also be used to derive explicit bounds on the approximation error. The 
bounds for positive everywhere and especially differentiable f(y\x) are par- 
ticularly informative. It is also easy to deduce an approximation rate from 
them. Thus, I present below the bounds and approximation rate for these 
special albeit important cases. Convergence rates and bounds for other spe- 
cial classes can be obtained in a similar way, for example, for densities 
bounded away from zero. However, rates and bounds for the general case 
seem to be difficult to calculate. 



Corollary 2.1. Part (i). Suppose the model p(y\x, Mq) and the parti- 
tion Aj 1 are constructed so that (2.1), (2.2), (2.3) and (2.5) hold. Suppose 
f(y\x) is positive and continuous in y onY = R d for all x, second moments 
of y are finite and (2.4) holds with C(r,y,x) = C r (y) taken to be a hypercube 
with center at y and radius r. Then, for all sufficiently large m, 

(2.7) d Kh (F,M )< /kg ,,, f ^ X \ f ^ F (dy,dx) 



(2.8) 
(2.9) 

(2.10) 



™f ze c Srn (y)f(z\xy 



3d 3 / 2 5 d ~ 1 h 
+ 2 1 Z 7" +2exp 



+ 



+ 



{^) d/2 < 
log 



(<5m/<7rr 



f(y\x) 



y'y 



inf *ec r (y) f(z\x] 
r/2) d 



F(dy, dx) 



(2^)^/2 



F(dy,dx), 



where B Sm (A^) = {(y,x) :C Sm (y)nA^ ^ 0} and bounds in (2.7)-(2.10) con- 
verge to zero as m — > oo. 

Part (ii). If f{y\x) is continuously differentiable in y for all x and instead 
of (2.4) the following condition holds: 

dlogf(z\x) 



(2.11) 



sup 

zeC r (y) 



dz 



F(dy, dx) < 00, 



then for all sufficiently large m, 
(2.12) d KL (F,M )<5 m - — 



sup 

z ^ C f>m (y) 



dlogf(z\x) 



dz 



F(dy, dx) 



(2.13) 
(2.14) 



oj3/2j;d-l^ 

+ 2 Z 7" +2exp 



+ 



(2^)<W, 



. d l/2 



sup 



(8m/0- m f 



dlogf(z\x) 



dz 



F(dy, dx) 
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(2.15) 



+ 



Mo) 



y'y 



(r/2Y 



(27rcj2) rf / 2 



F(dy,dx), 



and bounds in (2.12)-(2.15) converge to zero as m—> oo. 

Part (iii). //, in addition to assumptions from part for some q > 2 
and some i\ G {1, . . . , d} 



(2.16) 

and 

(2.17) 



J ' \yi\ q F(dy)<oo, ie{l,...,d}, 
d\ogf{z\x) 



\yh\ q 2 sup 



dz 



F(dy, dx) < oo, 



z£C r (y) 

then the approximation error bound can be written as 

■ 1 N l/(d-[2+l/(g-2)+ E ]) 



(2.18) 



dKL(F,M ) Ke- 



rn- 



where e > can be arbitrarily close to zero and c does not depend on m. 

The corollary is proved in the Appendix. The bounds in part (i) of the 
corollary follow from the proof of Proposition 2.1. The bounds in part (ii) are 
derived from the bounds in part (i), and they are especially easy to interpret. 
The larger the "average" derivative of log/(y|x) is the smaller 5 m has to be 
to achieve a prespecified level for the right-hand side of (2.12). Constant h m 
has to be much smaller than a m , and a m has to be much smaller than 5 m 
[condition (2.3)] so that (2.13) becomes sufficiently small. Size of (2.14) and 
(2.15) depends on how fast and by how much tails of f{y\x)f{x) dominate 
d\og f (y\x) j dy , y 2 , and a constant. 

The approximation rate in part (iii) is derived from the bounds in part (ii). 
Expressions in (2.12) and (2.13) can be immediately converted in expressions 
in terms of m. To convert (2.14) and (2.15) in expressions in terms of m one 
seems to need slightly more than integrability of sup^g^^) ||dlog/(2|x)/ciz|| 
[condition (2.17)] and slightly more than finiteness of the second moments of 
y [condition (2.16)]. Under these conditions, (2.14) and (2.15) are bounded 
by (h rn m 1 / d )~( q ~ 2 ^ times a constant (see the corollary proof). An upper 
bound on (l^m 1 /^^ 2 ), (2.12) and (2.13) gives the rate in (2.18). This 
upper bound has to be strictly larger than (2.18) with e = as I show in the 
corollary proof. For distributions with exponentially declining tails, (2.14) 
and (2.15) can be decreasing exponentially in h m m l l d . In this case, one can 
set q = oo in (2.18) (see Example 5.3 below). 

The dimension of y enters the approximation bounds exponentially. The 
dimension of x does not affect the bound and the approximation rate for 
the "infeasible" model because this model is constructed with the use of 
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F(A^ l \x) , s, which are unknown functions of x. The following sections shed 
some light on the role of the dimension of x in approximating f(y\x) by 
feasible models. 

3. Flexible multinomial choice models for mixing probabilities. This sec- 
tion gives conditions under which approximation results for "infeasible" 
model A^o also hold for a model with logit mixing probabilities that in- 
clude polynomial terms in x. It also shows how to extend these results to 
multinomial probit and other models for mixing probabilities. 

Assumption 3.1. A" is compact and for partitions A™, j = 0, 1, . . . ,m 
satisfying (2.1), F(Aj l \x) is a continuous function of x on X and F(A l J l \x) > 
[the support of f(-\x) does not depend on x\. 

Under this assumption (by the Stone-Weierstrass theorem) for any se- 
quence of e m — > 0, e m > there exist finite order polynomials in x, PJ n (x) 
such that 



, m. 



(3.1) |lo gj F(Af|x)-i7(x)|<e m Vx€X,j = l, 

Let p(y\x, M.\) denote a model with aj 1 and nj 1 independent of x and logit 
mixing probabilities, 



af(x,Mi) 



exp{P™(x)} 

ET=i«ap{^W} 

F(Af\x)exp{P™{x) -logF(Af\x)} 

= ET=i F ( A ?\x)eMPr( x )- l °z F ( A k\ x )}' 

Condition (3.1) implies af (x, Mi) G (F(Af\x) exp{-2e m }, F(Af\x) exp{2e m }). 
The following corollary immediately follows. 

Corollary 3.1. If Assumption 3.1 and the conditions of Proposition 
2.1 hold then dKh{F,M\) is bounded above and below by d^^F^Jido) ziz2e m 
and thus converges to zero. 

It seems possible to extend this corollary to other models for mixing 
probabilities, in particular, to a class of multinomial choice models in which 
mixing probabilities have the following representation: 

a™{x) = Pr[(e ,. . . ,e m ) :vj(x) + ej > v k (x) +e k ,ke {0, . . .,m}], 

where Vj(x) are flexible functions of x and e^'s are i.i.d. Multinomial logit 
and probit models fall into this category with polynomial Vj(x) and ex- 
treme value and normal distributions for e^'s. The proof of Proposition 1 in 
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Hotz and Miller (1993) implies that if are i.i.d. and have a density with 
respect to the Lebesgue measure, which is positive on R, then 

(v (x), . . .,v m -i{x)) = Q(a™(x), . . . ,aZ-i{x)), 

where v m (x) is normalized to and Q and Q~ l are differentiable map- 
pings defined correspondingly on R m , and the interior of the m-dimensional 
simplex. Flexible functional forms for (t>o(x), . . . , v m -i(x)) can be used to ap- 
proximate QCF^Ix),...,^™.^)). Then (cff(x), . . . , = 
Q~ 1 (v (x),.. . , v m -i(x)) will approximate (F(Aff\x), . . . ,F(^™ _1 |x). To get 
an analog of Corollary 3.1 one only needs to show that Q _1 transfers small 
additive approximation errors in Vj(x) into multiplicative approximation er- 
rors for q™(x), that are close to one. Since the mapping Q~ l is continuous 
this is the case as long as F(AJ l \x) are positive. Thus, it seems one does not 
need more than Assumption 3.1 to extend Corollary 3.1 to other models for 
mixing probabilities. 

Of course, Corollary 3.1 can be formulated for any other method for ap- 
proximating continuous functions in the sup norm on compacts, for example, 
for splines instead of the polynomials in the logit mixing probabilities. 

The corollary implies that for F satisfying conditions of Corollary 2.1, 
bounds on the approximation error for model A4i are given by the bounds 
in the corollary for Mo plus e m . Results from the function approximation 
theory [see, e.g., Section 3.3 in Rust (1996) for a survey] suggest that to 
achieve a worst case approximation bound e m , computable approximations 
to Lipschitz continuous functions must involve the number of parameters 
proportional to e^f x {em x ^ n if the function has bounded derivatives up to 
order n + 1). Thus, the number of parameters in the polynomials (or splines) 
Pp(x) depends at best exponentially on the dimension of x. 

It might be very difficult to estimate a model with high order polynomials 
in the logit mixing probabilities. The following section shows that it is not 
necessary to use high order polynomials in logit specification to attain flex- 
ibility. However, as I discuss at the end of the following section, polynomial 
terms might reduce the number of mixture components required to achieve 
a specified approximation precision. 

4. Linear indices in logit. In this section I explore an alternative ap- 
proximation to F(Aj l \x) based on logit mixing probabilities that use only 
linear indices in x. The following assumption is a slightly stricter analog of 
Assumption 2.1. 

Assumption 4.1. 1. X = [0, l] dx (the arguments would go through for 
a bounded X). 

2. f(y\x) is continuous in (y,x) a.s. F. 



APPROXIMATION BY MIXTURES OF REGRESSIONS 



13 



3. The second moments of y are finite. 

4. For any (y,x) there exists a hypercube C(r,y,x) with side length r > 
and y 6 C(r,y,x) such that (i) 

(4.1) / log— 1^*1 —-F(dy,dx) <oo 

J mI 2GC(r,j/,a;),||t-a;||<r Jl z l r J 

and (ii) exists M such that for any m > M, if y € A™ then C(r, y, x) D A™ 
contains a hypercube C^r, y, cc) with side length r/2 and a vertex at y and 
if y G y \ Aq 1 , then C(r, y, x) n (Y \ A™) contains a hypercube C±(r, y, x) 
with side r/2 and a vertex at y. 

Let .B,™, i = 1, . . . ,N(m) be equal size half-open half-closed hypercubes 
forming a partition of X = [0, l] dx . The partition becomes finer as m in- 
creases. \{B™) = iV(m)" 1 -> 0. Let x™ denote the center of Bf\ Before 
looking at logit let us consider an "infeasible" model Ai 2 , 



N(m) 



p(y\x,M 2 ) = 



^2 dg(x, M 2 )4>{y, t-L™, <r m ) + a%(x, M 2 )(f>(y, 0, <r ) 



where the mixing probabilities a"j(x, M 2 ) = ^BY l (x)F(AJ l \x 1 [ l ). As the par- 
tition of X becomes finer, model M 2 approximates A^o because F(A'J l \x) ~ 
Z~2^i^ ^-Bv rl (x)F(AJ l \x^ 1 ) under continuity of f(y\x) in x (part 2 of Assump- 
tion 4.1). Since, M 2 is not interesting on its own I do not make this argu- 
ment precise here. Instead I employ this idea to get approximation results for 
model A^3 constructed similarly to A4 2 but with logit mixing probabilities, 

m/ exp{logF(Af\x™)-R m (xfx™-2xfx)} 



£ fc ,exp{logF(A™|x™) - Rrnixfx™ - 2xfx)} 
(4.2) 

m . m . zx V {-R m {xf'x™-2x?'x)} 



= F(A m \x m ) 

J 'Eiexp{-Rrn(xf"x™ -2x™'x)}' 

In this expression, R m is a positive diverging to infinity sequence that sat- 
isfies the following condition: 

(4.3) exp{-R m s m }/s d ^ 2 -> where s m = d x \(B™) 2 / dx -> 0, 

is the squared diagonal of B™. This condition specifies that R m should 
increase fast relative to how fine the partition of X becomes. It is always 
possible to define sequence R m satisfying (4.3), for example, R m = s~ 2 . 

Proposition 4.1. If condition (4-3), Assumption 4-1, and conditions 
of Proposition 2.1 hold then dKh(R, -Ms) — > as m — >■ oo. 
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The proposition is proved in the Appendix. The proof shows that the 
expression in (4.2) multiplying F(AJ l \x r i n ') behaves like 1b™ { x ) when R m 
becomes large and then uses the same arguments as in the proof of Proposi- 
tion 2.1. Attempts to develop similar results for mixing probabilities modeled 
by multinomial probit [see, e.g., Geweke and Keane (2007) for applications] 
were not successful. It would not be hard to make multinomial probit mixing 
probabilities behave like indicator functions. However, making them behave 
like an indicator times F(A 1 J l \xY l ) as in (4.2) seems to be more difficult. 

The bounds on the approximation error for M3 and f(y\x) positive ev- 
erywhere are similar to bounds for A4$ obtained in Corollary 2.1. This is 
formalized in the following corollary. 



Corollary 4.1. Part (i). Suppose conditions of Proposition 4-1 hold, 
f(y\x) is positive for any y 67 = R d and any x G X, f{y\x) is continuously 
differentiable in (y,x), and instead of (4-1) the following condition holds: 



(4.4) 



/ 



sup 

y&C r (y),\\x-t\\<r 



dlogf(z\t) 



d(z,t) 



F(dy, dx) < 00; 



then, for all sufficiently large m, 



(4.5) 

(4.6) 
(4.7) 

(4.8) 
(4.9) 



d 1 ' 2 

d KL {F,M 3 ) < ( 5 m — + s m 



1/2 



sup 

z&C Sm {y),\\x-t\\< 



,1/2 



d\ogf(z\t) 



M z / 2 5 d " l h 
+ 2 7„ Z 7" +2exp 



+ 



+ 



rd 1 ' 2 



d(z,t) 



F(dy, dx) 



sup 

,{A™)z&C r {y),\\x-t\\<r 



y'y 1nr (r/2) d 

2a 2 g (2ira 2 ) d / 2 



d\ogf(z\t) 



d(z,t) 
F(dy, dx) 



F(dy, dx) 



+ log[l - d%°' 2 exp{-RmS m }/s% /2 ], 



and bounds in (4-5)-(4-9) converge to zero as m—^00. 

Part (ii) . If, in addition to assumptions from part (i) , for some q > 2 and 
some %\ £ {1, . . . , d}, 



(4.10) 



j M q F{dy) 



< 00, i <E {1, . . . ,d}, 
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and 

(4.11) I \y n r 2 sup 

J z£C r {y),\\x— t\\<r 

then the approximation error bound can be written as 

(4.12) d KL {F,M 3 ) < constant - [mN(m)]- 1 /^ +d ^ +1 ^ q - 2 ^\ 

where mN{m) + 1 is the number of mixture components in M3 and e > 
can be arbitrarily close to zero. 

From the definition of models Mi and M3 and from the comparison of 
the convergence rates in (2.18) and (4.12), it is clear that using only linear 
indices in x in the mixing probabilities does not come without a cost. The 
number of mixing components in model M3 that approximates an infeasible 
model Mo is equal to mN(m) + 1 while for model with polynomial terms 
in logit, Mi, this number is m + 1 (Corollary 3.1). The proof of Corollary 
4.1 implies that the number of hypercubes in the partition of X, N(m), 
increases exponentially with the dimensionality of X . Thus, the number of 
parameters in model M3 grows exponentially in the dimension of x (the 
exponential growth of the number of parameters in Mi is discussed at the 
end of the previous section). Overall, approximation results for Mi and 
M3 do not seem to suggest which model might perform better in practice; 
however, they seem to identify a tradeoff between the number of components 
in the mixture and the flexibility of models for the mixing probabilities. 

5. Flexible means and variances. In this section, I show that a finite 
mixture of normal regressions models, in which mixing probabilities do not 
depend on x, can be quite flexible. However, the results also suggest that 
specifications in which mixing probabilities are flexible functions of x might 
perform better. 

There is a large literature on finite mixture of regressions models. In 
early work, mixtures of two normal regressions were considered [see, e.g., 
Quandt and Ramsey (1978) andKiefer (1978)]. Jones and McLachlan (1992) 
applied the EM algorithm for estimation of finite mixtures of normal regres- 
sions. Fitting of more general finite mixtures of generalized linear models 
has been considered in Jansen (1993) and Wedel and DeSarbo (1995) among 
others. Many more references can be found in a comprehensive book on finite 
mixture models by McLachlan and Peel (2000). 

To the best of my knowledge, the literature on finite mixtures of regres- 
sions does not contain any approximation results for conditional densities. 
The closest analogs of the results I obtain can be found in the literature on 
finite mixtures of unconditional densities [see, e.g., Zeevi and Meir (1997) 



d\ogf(z\t) 
d(z,t) 



F(dy, dx) < 00, 
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and references therein and Li and Barron (1999)]. Even for mixtures of un- 
conditional densities approximation results for the KL distance, which is 
useful for establishing consistency of Bayesian or classical maximum likeli- 
hood estimators, seem to be scarce. Approximation results in the KL dis- 
tance for convex combinations of densities in Zeevi and Meir (1997) and 
Li and Barron (1999) seem to apply to mixtures of truncated normals and 
to target densities that are compactly supported. Some of these results are 
very strong. For example, for target densities that are general mixtures of 
the densities mixed in the model, approximation error bounds obtained by 
Li and Barron (1999) are proportional to m^ 1 . If there are no covariates x, 
then the infeasible model from Section 2 is simply a finite mixture of multi- 
variate normals. For an elaboration on this idea in the context of joint and 
conditional density estimation and for consistency results for a Bayesian es- 
timator based on this model see Norets and Pelenis (2009). The convergence 
rates obtained for this model in Section 2.1 are slower than m . However, 
the convergence rates are not directly comparable as the target densities in 
Li and Barron (1999) are different from those considered here. 

Model Aii constructed in this section is very similar to model Mq except 
for one important difference. In AI4, fine equal probability partitions of Y 
are used instead of fine equal length partitions in Mq. As will be clear below, 
A^4 defined in this way allows mixing probabilities to be independent of x. 
However, it requires the means of the mixed normals to be flexible functions 
of x. In this section, I assume that the response variable is univariate: Y C R 
or d = 1 (all the results from previous sections were obtained for arbitrary 
d). If fine equal probability partitions can be well defined for distributions 
of multivariate random variables and if these partitions depend smoothly on 
covariates, then it might be possible to extend the results of this section to 
multivariate responses. I do not pursue this conjecture here. 

Define model A^4 as follows: 

m 

P {y\x,M^) = Y, a T^y^T^)^T^))- 

For a given x let A™(x), j = 0, 1, . . . , m, be a partition of Y such that 
Uj=i A™ (x) is a nondecreasing interval and 

F(Af(x)\x)=p m , j>0, 

(5.1) 

F{Aq{x)\x) = 1 - mp m and mp m -tl, 

for some p m € (0, m _1 ] that does not depend on x. Define an upper bound on 
the length of an element of the fine part of the partition h m {x) > 
maxj>o X(Aj l (x)). The candidate mixing probabilities are given by aj 1 = 
F(A™(x)\x) and 11™ (x) EA l J l (x). The standard deviations aj l (x) = a m (x) 
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for j > and o-™{x) = cfq(x) are treated as functions of x which is not es- 
sential but it weakens the restrictions on F (Corollaries 5.1 and 5.2 and 
Examples 5.1 and 5.2 below illustrate this point). Note that AI4 is an infea- 
sible model; in Corollary 5.2 below, I consider a feasible model M$ in which 
^(x) are approximated by polynomials (see also Examples 5.1 and 5.2). 
Suppose sequences 5 m (x), a m (x), and h m (x) satisfy 

(5.2) 5 m (x)^0, ^TT^O, ^44^0. 

Next, let us introduce the following restrictions on F. 

Assumption 5.1. 1. Partitions AJ l (x) used in construction of p(y\x, M.&) 
satisfy (5.1), and (5.2) holds. 

2. f(y\x) is continuous in y a.s. F. 

3. For any (y,x) there exists interval C(r(x),y,x) with length r(x) > and 
y G C(r(x),y,x) such that (i) 

(5.3) / log — f{y]x) - F(dy, dx) < 00 

and (ii) exists M such that for any m > M, if y G Aq 1 (x), then C(r(x),y, x) n 
A™(x) contains an interval Co(r(x), y, x) with an end at y and length 
r(x)/2, and if y G Y \ Aff(x), then C(r(x), y, x) n (Y \ Aff(x)) contains 
an interval C\{r(x),y,x) with an end at y and length r(x)/2. 

4. h m (x), a m (x), and r(x) satisfy 

1- .\ 0~m\X) „ h m (x) 

(5.4) sup — -)• 0, sup 7-^ ->■ 0. 

x x cr m [x) 

5. o"o(x) and r(x) satisfy 

(5.5) l>l/4>^(y,0, £ T (x))r( 2; )/2, 

which holds, for example, when ao(x) > 2(2-7r)~ 1 / 2 • r(x). 

6. 1/ log[0(y, 0, ao(x))r(x)/2]F(dy, dx) | < 00. 

Proposition 5.1. If Assumption 5.1 holds then oriX-P, M4) -> as 
m — >• 00. 



The proposition is proved in the Appendix. The assumptions of the propo- 
sition and their role in the proof are similar to those discussed in detail in 
Section 2 for A4q. The assumptions are satisfied by a large class of densi- 
ties as illustrated by the following corollaries and examples. Approximation 
error bounds for M4 are presented below in Corollary 5.3. 
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a(x) a,(x) b,(x) b(x) y 

Fig. 2. Approximation of densities with bounded support by M4. 



Corollary 5.1. Assume: 

1. f(y\x) is continuous in y in the interior of the support of f{y\x) for all 
x£X. 

2. There exists f < oo, such that f{y\x) < f for all (y,x). 

3. The support of f(-\x) is given by a finite interval [a(x),b(x)], where a(x) 
and b(x) are square integrable. Also, for some f € (0, 1), a positive integer 
n, and a(x) < ai{x) < b x {x) < b{x),f(y\x) > f_ on [ai(x),b 1 (x)\, f(y\x) > 
f_-[y-a{x)] n on (a(x),ai(x)), and f(y\x)> f_-[b(x)-y] n on (b 1 (x),b(x)). 
Figure 2 provides an illustration for n=l. 

4. There exists r > such that f(-\x) is nondecreasing on (o(x), a\{x) +r/2) 
and nonincreasing on {b\{x) — r/2,b(x)) for all x £ X . 

Then for M4 constructed so that p m = 1/m, A™ = 0, ^(x) E A 1 J l {x) and 

o~m{x) = Pm' 4 '" +1 " and o~o(x) = 2(27r)~ 1 / 2 • r are independent of x, g!kl(-^\ 
Mi) ->-0. 

Corollary 5.2. Assume conditions from Corollary 5.1, F~ 1 (p\x) is 
continuous in x for all p € [0, 1] , X is compact. Then there exists a sequence 
of polynomials P™{x) such that dKh(F, M5) — > where 

m 

P (y\x,M 5 ) = Y,Pm4(y,PP(x),pU 8 )- 
i=i 

Proof. Let fj,f(x) = F~ l {{j - l/2)p m \x). Note that nf(x) € Af{x) = 
[F-H(j - l)p m \x),F" l (jp m \x)} and 

rF~ 1 (jp m \x) _ 
p m /2 = / f(y\x) dy < {F-\j Pm \x) - ^{x))f. 

Jtifix) 

Similarly, Pm /2 < (fif(x) - F~ x {{j - l) Pm \x))J. Thus, for e m =p m /(2j), 
(fi 1 J l (x) — e m ,iiJ l (x) + e m ) ci™(i). By the Stone- Weierstrass theorem there 
exist finite order polynomials in x, Pj Fl (x) such that \Pj n (x) — < e rn . 
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P j n (x) £ A™(x), which was the only requirement on the means 
of the mixture components in Corollary 5.1. □ 

Example 5.1. Exponential distribution, f(y\x) = j(x) exp{— ^y(x)y}, 
"f(x) > 7 > 0, j(x) is continuous, J ^ydF < oo and the second moment of 
y is finite (J 7~ 2 dF < oo). The quantile function is given by F~ 1 (p\x) = 
— 7(x)~ 1 log(l — p). Let the partition be such that A™ = [F~ 1 (mp m \x), oo). 
Since the exponential density is decreasing the largest interval in the fine 
part of the partition is given by ^4™ = [F -1 ((m — l)p m \x), F~ l (mp m \x)) . 
Therefore, h m (x) = h m = 7" 1 log(l + p m /(l - p m m)). Choosing p m = (m- 

m°' 5 )/m 2 guarantees that h m — > 0. For a rn = h\i A , and S m (x) = hll 8 , and 
r(x) = 1 conditions (5.1), (5.2) and (5.4) hold. 

Next, let C(l,y,x) = [y, y + 1] if y G [0, 1/2] , C(l, y, x) = [y - 1/2, y + 1/2] 
if y € [1/2, 00). Since 



we have 



inf f(z\x) >7(x)exp{-7(x)(y + l)}, 

zeC{l,y,x) 



l<f{y\x)/ inf f(z\x) <exp{j(x)}. 

zeC(l,y,x) 



Inequality (5.3) is satisfied since 7(2;) is assumed to be integrable. Finally, 
let <Jo(x) = 2(2-7r)~ 1 / 2 so that equation (5.5) in Assumption 5.1 holds. Then, 



log[0(y, 0, a (x))r(x)/2]F(dy, dx) 



log(4)-^ 



F(dy, dx) 



< 00 



since the second moment of y is assumed to be finite. Thus, condition 6 of 
Assumption 5.1 holds. 

If X is compact the same argument as in the proof of Corollary 5.2 can 
be used to show that ^{x) can be polynomial in x [for fixed m there exists 
e m > such that \{AJ L {x)) > e m for all x and j]. 

It is possible to give sufficient conditions for approximation results when 
7(x) is not bounded away from zero, for example, let r(x) = 7(x) _1 , h m (x) = 
7(x) _1 log(l +p m /(l — p m m)), etc. However, then a m and o"o would have to 
be functions of x [not necessarily flexible functions of x but functions that 
would have the same order as 7(2;)]. Also, j(x)~ 1 is not continuous and the 
argument I use for justifying the use of polynomial n™(x) breaks down in 
this case. 

Example 5.2. Uniform distribution, f(y\x) = b(x)~ 1 l[ ^ x ^(y), b(x) > 
is continuous, f log b dF < 00 and the second moment of y is finite ( J b 2 dF < 
00). This example demonstrates that the support of f(y\x) does not have to 
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be (un)bounded uniformly in x as long as normal variances are modeled as 
flexible functions of x. 

Let the partition be such that A™ = and p m = F(AJ l \x) = m , j > 0. 

Note that h m (x) = b(x)/m. For a m (x) = b(x)pm 4: , and 5 m (x) = b(x)pm 8 , and 
r(x) = b{x) conditions (5.1), (5.2) and (5.4) hold. 

Next, let C(r(x),y,x) = [0,b(x)]. Note that f(y\x)/mf zeC ^ t y^f(z\x) = 
1, and inequality (5.3) is satisfied. Finally, let (Jq{x) = 2(2-7r)~ 1 / 2 &(x) so that 
inequality (5.5) in Assumption 5.1 holds. Then, 



log[<£(y, 0, a (x))r(x)/2]F(dy, dx) 



■log(4) -vr/(3-4)| <oo 



and condition 6 of Assumption 5.1 holds. 

If X is compact and b{x) is bounded away from zero then the same argu- 
ment, as in the proof of Corollary 5.2, can be used to show that ^(x) can 
be polynomial in x [for fixed m there exists e m > such that \(A™(x)) > e m 
for all x and j]. 

Corollary 5.3. Suppose conditions of Proposition 5.1 are satisfied for 
h m (x) = h 

in ■ o~m> $m{%) — $m and r(x) — r that do not depend on 

x. Also, suppose conditions from parts (i) and (ii) of Corollary 2.1 hold. 
Then for all sufficiently large m, 

d 1 / 2 f dlogf(z\x) 

sup t [dy, dx) 

z£C Sm (y) 



(5.6) d KL {F,Mi)<b>. 
(5.7) 



+ 2 



3h r , 



(27T)V2 <Tn 



+ 2exp 



dz 



(5i 



(5.9) 



r 

+ 2 



+ 



sup 



dlogf(z\x) 



y'y 



^r log 



dz 
(r/2) 



(2vra 2 )V2 



F(dy, dx) 



F(dy,dx), 



where Bs rn (A Q n (x)) = {(y, x, ) : C$ m {y) n A Q n (x) / 0}and bounds in (5.6)- 
(5.9) converge to zero as m—^oo. 



Proof. The proof is identical to the proof of Corollary 2.1. □ 

The bounds for .M4, (5.6)-(5.9), are almost the same as the bounds for 
Mo, (2.12)-(2.15), obtained in Corollary 2.1, except for a difference between 
Bs m (A n (x)) in M4 and Bs m (A a ) in Mq- For the same value of h m , the 
length of the complement of A n {x) in AI4 is bounded above by mh m [h m = 
maxj>o X(AJ l (x))] which is the length of the complement of A™ in A^o- Thus 
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the bounds obtained for M4 are likely to be larger than the bounds obtained 
for Mo- Compact and interpretable conditions sufficient for deriving an 
explicit approximation rate for M± from (5.6)— (5.9) seem to be difficult 
to find. Instead, I show in the following example that not only bounds for 
A4q can be smaller but also that convergence for A4$ can be slightly faster 
than for M4. 

Example 5.3. Laplace distribution, f(y\x) = 0.57(2;) exp{— 7(x)|y|}, 
70*0 > 7 > 0, 7(3;) is continuous, j ^ydF < 00 and the second moment of 
y is finite (f 7 -2 dF < 00). Note that nondifferentiability of f(y\x) at zero 
does not affect any of the theoretical results above. 

First consider M4. Let Af(x) = [F~ 1 ((l-p m m)/2 + (j-l)p m \x),F~ 1 ((l- 
Pm m ) /2+ jPm\ x )) ■ Note that F~ 1 (p\x) = log(2p) /j(x) for p < 0.5 and F~ l (p\x) 
-log(2(l-p))/7(x) for p> 0.5. Then, 

h m > F- 1 ({1 - Pm m)/2 + p m \x) - F^{{l-p m m)/2\x) 

(5.10) 

log 1 + 



7(x) V l-p m m 
Since h m —> and mp m — > 1 we can write 

1 

Pn 



m + g(m) ' 

where g(m) satisfies g(m)/m — > and g(m) — > 00. Note that 

r ( A™(„\\ ( _ 1 °g( 1 -P m m)(l -£q) \ ( log(l -p m m)(l -£ ) 

for any eq E (0, 1) and all sufficiently large m. A direct calculation shows 
that integrals in (5.8) and (5.9) can be bounded by 

constant • (1 — p m m) l ~ e < constant • (g(m) /m) 1 ^ 6 

for any e € (£o,l) and all sufficiently large m. From (5.10) and the mean 
value theorem, 

h m > constant • 7(x)~ 1 • ^(m)" 1 . 

Since the approximation error bounds increase in /i m , we should choose 
the smallest possible value for h m = constant • 7 -1 • g(m)~ l . One can ver- 
ify that the smallest upper bound for S m , h m /a m , exp{— (5 m /<x m ) 2 /8} and 
(g(m)/m) 1 ~ e is inside the interval (to -1 / 3 , m~ 1 ^ 3+61 ^] for any e\ > and all 
sufficiently large to. Thus, 

/ 1 \ V[3+ei] 

dKh(F, M.4) < constant • ( — I 

\m J 
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Next, consider Aio- Expressions (2.14) and (2.15) are exponentially de- 
creasing in h m m. Setting h m to a power of m, one can show that 

/ 1 \ VP+ea] 

d-KhiF, Mo) < constant • ( — I , 

\m J 

for any > and all sufficiently large m. These results suggest that A4o 
converges to the target density faster than M4. 



It might be unfair to compare approximation errors for and M4. 
Although both models are "infeasible" and include m functions that need to 
be approximated by polynomials (or splines), the error from approximation 
by the polynomials enters the total approximation error in different ways. 
Nevertheless, the results obtained in this section do seem to suggest that 
models in which mixing probabilities depend on covariates might perform 
better in practice. 

6. Comparison with Jiang and Tanner (1999). Jiang and Tanner (1999) 
is the only work on approximation of conditional densities by ME that I am 
aware of. Jiang and Tanner (1999) develop approximation and estimation 
results for target densities of the form 

(6.1) vr(y|x; h{-)) = exp(a(h(x))y + b(h(x)) + c{y)). 

Functions a, b and c are assumed to be known, a and b are assumed to 
have nonzero derivatives and h{x) is assumed to have uniformly bounded 
continuous second order derivatives. It seems that their results could still 
hold if a, b and c are known only up to some parameters (see their Remark 
4). Jiang and Tanner (1999) show that ir(y\x;h(-)) can be approximated in 
the KL distance by ME of the form 

m 

(6.2) J^CsMylxjfyQ), 

j'=i 

where 7r(- 1 • ; •) is defined in (6.1), hj(x) is a linear function of x and the mixing 
probabilities a™(x) can be modeled by logit (more general specifications for 
mixing weights are also allowed) . The idea of their argument is to divide X 
into a fine partition BJ 1 , approximate 1b™(x) by aj l (x) and approximate 
h(x) by linear function hj(x) on BJ 1 . Jiang and Tanner (1999) prove that 
for their target class of densities a bound on the approximation error is 
proportional to m~ i / dx . 

There are several important differences between the present work and 
Jiang and Tanner (1999). First, I consider multivariate responses, y, while 
Jiang and Tanner (1999) consider univariate responses. Most importantly, I 
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do not assume that functional form of f(y\x) is known, for example, known 
7r, a, b and c. The components of the model I employ, for example, normal 
densities and logit mixing probabilities, are generally not related to the true 
density. As Examples 2.2 and 2.3 and Corollary 5.1 illustrate, many densities 
that are not from (6.1) are shown to be approximable by ME models. Exam- 
ples 2.1 and 5.1 also show that some of the densities from class (6.1) satisfy 
sufficient conditions for approximation results I obtain. However, there might 
exist densities from (6.1) that violate these sufficient conditions. This would 
not be surprising since the "correct" functional forms are mixed in (6.2). For 
the same reason it is not surprising that the approximation rate obtained by 
Jiang and Tanner (1999), m~ i / dx , differs from the ones obtained here, for 
example, ra 

-i/[d x +2+i/( q -2)+e] for model M3 in Corollary 4.1. 
Finally, responses in Jiang and Tanner (1999) class (6.1) can be discrete, 
for example, Poisson. To accommodate discrete responses in the framework 
of the present paper one could map the discrete values of response y into a 
partition of R and introduce a corresponding latent variable y* ~ p(y* \x, M). 
For example, for binary y £ {0, 1} let y* € (— oo,0) if y = and y* 6 [0,oo) 
if y = 1. Any discrete distribution can be represented by a continuously 
distributed latent variable in this fashion. This continuous distribution can 
be flexibly modeled by p(y*\x,A4). Models with latent variables are easy 
to estimate in the Bayesian framework using MCMC methods [see, e.g., 
Tanner and Wong (1987) and Albert and Chib (1993)]. 

7. Discussion. This paper shows that large classes of conditional den- 
sities can be approximated in the Kullback-Leibler distance by different 
specifications of finite smooth mixtures of normal densities or regressions. 
The theory can be generalized to smooth mixtures of location scale densities. 
These results have interesting implications for applied researchers. 

First of all, smooth mixtures of densities or experts can be used as flexible 
models for estimation of multivariate conditional densities. It seems this 
issue has not been explored in the literature and it would be interesting to 
see how specifications studied in the paper work in these settings. 

Second, smooth mixtures of simple components, for example, models in 
which mixing probabilities are modeled by multinomial logit linear in co- 
variates and the means and variances do not depend on covariates, can be 
quite flexible. A simulation study in Villani, Kohn and Giordani (2009) sug- 
gests though that models with more complex components perform better in 
practice. This issue should be further explored in simulation studies. 

Third, results in Section 4 suggest that making mixing probabilities more 
flexible, for example, by using polynomials in logit, might reduce the number 
of necessary mixture components. However, these models are more difficult 
to estimate. 
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Fourth, models in which mixing probabilities do not depend on covariates 
can be very flexible at least for univariate response variables. However, they 
seem to require a lot of mixture components and very flexible models for the 
means of the mixed normals. Also, approximation error bounds and conver- 
gences rates (Example 5.3) obtained in Section 5 suggest that models with 
flexible mixing probabilities might perform better in practice than models 
with flexible means of the mixed normals and constant mixing probabilities. 
Nevertheless, it would be interesting to see how these specifications perform 
in actual applications and simulation studies. 

On the basis of a simulation study, Villani, Kohn and Giordani (2009) 
generally recommend using heteroscedastic experts (mixture components 
with variances that depend on covariates). The theory obtained here sug- 
gests that heteroscedastic experts might be necessary when differences in 
quantiles of f(-\x) are not uniformly bounded in x and, especially, when the 
support bounds of f(-\x) are increasing without a bound in x (see Examples 
2.4 and 5.2). This suggestion is likely to remain useful when the differences 
in quantiles and/or support of f(-\x), although bounded, still change con- 
siderably with covariates. 

Practical implications of the theoretical results obtained in the paper and 
summarized in this section are deduced under the assumption of no estima- 
tion and parameter uncertainty. Exploring the behavior of the estimation 
error in addition to the approximation error would result in a more com- 
plete understanding of the ME models. This issue is left for future work. 

Overall, the paper provides a number of encouraging approximation re- 
sults for (smooth) mixtures of densities or experts which might stimulate 
more theoretical and applied work in this area of research. 

APPENDIX 

Proof of Proposition 2.1. Since g?kl is always nonnegative, 

Thus, it suffices to show that the last integral in the inequality above con- 
verges to zero as m increases. The dominated convergence theorem (DCT) 
is used for that. First, I establish conditions for point-wise convergence of 
the integrand to zero a.s. F. Then, I present conditions for existence of an 
integrable upper bound on the integrand required by the DCT. 
For fixed (y, x), 

in 

p(y\x, M ) = Y, F(A™\x)<f>(y, a m ) + F(A%\x)<l>(y, 0, a ) 
i=i 

(A.l) 

> inf J{z\x) V \(A?)<l>(y,ii?,* m ), 
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where A is the Lebesgue measure. 

In Lemmas A.l and A.2, I derive the following bounds for the Riemann 
sum in (A.l) (the Riemann sum is not far from the corresponding normal 
integral, and the integral is not far from 1): 

]T X(Af)^y^f,a m ) 
j:A?cC Sm (y) 



\2 



>m) 



{A - 2) " (2vr)^< (2vr)V2 6Xp 

~ (2^< eXP \ 8 

where the last inequality holds for all sufficiently large m (S m /a m — > oo). 
Given e > there exists Mi such that for m > M%, expressions in (A.2) are 
bounded below by (1 — e). 

If f(y\x) is continuous in y at (y,x) and f(y\x) > there exists M 2 such 
that for m > Ah, \f{y\x)/ m£ ge Q g r y \ f(z\x)\ < (1 + e) since S m —¥ 0. For any 
m > max{M ,M 1 ,M 2 }, 

f{y\x) 

1 < max< 1, 



p(y\x,M ) 



inf 2 ec im ( y )/(^|2;)(l -e) J 1-e 

Thus, logmaxjl, /(y|x)/p(y|a;, .Mo)} — ^0 a.s. F as long as f(y\x) is contin- 
uous in y a.s. F [f(y\x) is always positive a.s. F]. 

Parts 2 and 3 of Assumption 2.1 are used for establishing an integrable 
upper bound for the DOT 

m 

p(y\x, Mo) = F{AJ\x)<P{y, p?,v m ) + F(A n \x)(p(y, 0, a ) 
>[i-U-(y)] 

(A.3) 

x inf f(z\x). Yl X(Afmy,fJ,f,a m ) 

j:Ay»cGi(r,i/,x) 

+ l^™(y)- inf f(z|x) • A(C o (r,y,a;))0(y,O,(To). 

zGC (r,3/,x) 

Lemmas A.l and A.2 imply that the Riemann sum in (A.3) is bounded below 
by 2~ d — 2~( rf+1 ) = 2~( d+1 ) for any m larger then some M4. Inequalities (A.3) 
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and (2.5) imply 

log max < 1, 



p(y\x,M 



01 



f(y\x) 

< log max < 1, 



(A.4) 



mf 2 GC(r, y ,x) f{z\x) • 4>{y, 0, cr ) • (r/2) d 
lo S 17 n \, maxL(y, 0, <x )(r/2) d , ■ 



<A(y,0,cr )(r/2) d \ ' ' mf 2eC(r)2/ja .) f(z\x) 

< - log^y, 0, a )(r/2) d ) + log . /W 

™zeC(r,v,!B)/0 e l a 

where inequality (A.4) follows by the first inequality in (2.5). The first ex- 
pression in (A.4) is integrable by Assumption 2.1, part 2. The second expres- 
sion in (A.4) is integrable by Assumption 2.1, part 3. Thus the proposition 
is proved. □ 

Proof of Corollary 2.1. The proof of the first part of the proposi- 
tion is a simple implication of the argument in the proof of Proposition 2.1. 
Note that 

d KL (F,M )= I log /t^Ta s F (dv,dx) 

Jyxx\b s (a^) p{y\x,Mo) 

(A.5) 



Jb^(a^) p{y\x,Mo) 



.(xg 1 ) p(y\x,M 

For (y,x) € Y x X\B Sm (A n ), inequalities (A.l) and (A. 2) apply. Thus, the 
first integral in (A.5) is bounded by the sum of (2.7) and (2.8), where the 
bound in (2.8) is obtained by the mean value theorem for — log(l — x) and 
a small positive x, 

( A 6) os rKM- exp i — r- 

" V (2tt)^< +eXP \ 8 

By inequality (A. 3), the second integral in (A.5) is bounded by the sum of 
(2.9) and (2.10). 

Expression (2.7) converges to zero by the DCT. The point-wise conver- 
gence follows by the assumed continuity and positivity of f(y\x). An in- 
tegrable upper bound is given by (2.4). Expression (2.7) converges to zero 
by (2.3). Expressions (2.9) and (2.10) converge to zero because Y x X \ 
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B$ m (A™) / 7xl and the integrands are integrable by (2.4) and by the 
assumed finiteness of the second moment of y. Thus, the first part of the 
proposition is proved. 

The second part of the proposition [bounds for differentiable f(y\x)] fol- 
lows from the first part since 



log 



f(y\x) 



infzeCVQ/) f( z \ x ) 



< sup 

zeC r {y) 



dlogf(z\x) 



dz 



d x lh 



which is implied by the multivariate mean value theorem: for any (z\,z 2 ) 

|log/Oi|x) -\o%f{zi\x)\ < \\f'(czi + (1 - c)z 2 )\\\\zi -z 2 \\ 

for some c £ [0, 1] . Convergence of the bounds to zero is obtained in the same 
way as in the first part of the proposition. 

To obtain the third part let us suppose that the fine part of the par- 
tition {j4J\1 < j < m} is centered at 0. If (y,x) £ B$ m (A™), then \yi\ > 

h m m 1 / d /2 — 5 m > h m m}l d : /3 for i € {1, . . . ,d} and all sufficiently large m 
and 



(A.7) 



< 



y 2 i F(dy,dx) 



' {(y,x) : \y t \>h m m 1 / d /3,Vi} 



yfF(dy,dx) 



{(y,x):\y i \>h m m 1 / d /3yi} 



< {h m m l l d /2,)-^ 



YxX 



y q i F(dy,dx). 



Similarly, 



sup 

,(iiS»)*eC r (v) 



d\ogf(z\x) 



dz 



< 



< 



sup 

{(y,x): \y i \>h m m 1 / d /3,\/i} zeC r {y) 



F(dy, dx) 

dlogf(z\x) 



dz 



F(dy, dx) 



(A.t 



{(y,x):\y i \>h m m 1 / d /Zyi} 



x sup 

zec r (y) 



dlogf(z\x) 



dz 



F(dy, dx) 
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< (h m m^ d /3)- iq ~ 2) 



9-2 

Vt sup 

YxX zeC r (y) 



dlogf(z\x) 



dz 



F(dy, dx). 



Since integrals in (A. 7) and (A. 8) are finite by assumption, (2.14) and 
(2.15) can be bounded above by an expression proportional to {h m m l l d )~( q ~ 2 ^ . 
Thus, the sum of (2.12)-(2.15) is bounded by 

ci • S m + c 2 • exp{-(<5 m /cr m ) 2 /8} + c 3 • b d ~ x h m ja d m 

(A.9) 

+ c 4 -l/(/ lm m 1 / d r 2 , 

where constants c±, C2, C3 and C4 do not depend on m. Let b m be the small- 
est number satisfying b m > 5 m , b m 

> S d ^ 1 h m /a d a , b m > \/{h m m l l d ) q - 2 and 
b m > exp{— (5 m /a m ) 2 /8}. The first three of these inequalities imply 

b m >{[(S m /a m ) d }/ m y d }^+y^l 
It implies that for all sequences 5 m , a m and h m allowed by the corollary, 



l/(d-[2+l/( 9 -2)]) 



b m > (—) 

\m J 

One can verify that 

'(41ogm/ ( i) d / 2 \ 1/[2+1/(9 ~ 2)] 



(A.IO) 



b m < 



m 



l/d 



■ 



< 



i) 

m I 



l/(d-[2+l/(g-2)+e]) 



when 5 m equal to the first bound in (A.IO), (5 m /a m ) 2 = 4logm/d and h r , 
5 2 J{5 m /a m ) d . □ 



Proof of Proposition 4.1. Define I™(x,s m ) 



and I™(x, s m ) = {i : \\x™ - x\\ 2 > 2s m }. For i € I™(x, s T 
(A.ll) [-Rnixfx? - 2xfx)] > [-R m (s m - x'x)] 

and for i £ J™ (a;, s m ), 

(A. 12) [-RmW xf - 2xfx)} < [-R m (2s m - x'x)}. 

Note that 

Yliel^(x,s m ) ex P{ — Rm{%T' x T ~ 2xf l 'x)} 



{*: Wxf 1 -x\\ 2 < s m } 



(A.13) 



X)j exp{-# m (x™'x™ - 2xfx)} 



> 1 



E 



exp{-R m (xfx™ - 2xfx)} 



card(I 2 m Qr,s m )) dx/2 exp{-R m s m } 
" card(/f(x, Sm )) 6 Pi KmSm ^ L d * =05 : 



d»/2 



APPROXIMATION BY MIXTURES OF REGRESSIONS 29 

where the second inequality follows from (A. 11) and (A. 12). The last in- 
equality follows from the following bounds on the number of elements in 
/fO,s m ) and I^(x,s m ): card(/f l (x, s m )) > 1 [s m is chosen in (4.3) so that 

1/2 

any ball in X with radius Sm has to contain at least one x™] and 

card(/ 2 m (x,s m )) < N(m) = d^ /2 s^ /2 . 
For i G IF(x, s m ) and Af C C Sm (y), 

(A.14) F(A™\xT)>\(A?) inf /(z|t). 

zeC Sm (y),\\t-x\p<s m 

Inequalities (A. 13), (A.14) and (A. 2) imply that p(y\x,Ms) exceeds 

E\ " F( A m W m \ eyL v{~ R m{x 1 T'xY l - 2x™'x)} 
j:A T cC Sm ( y )ieI?( X ,s m ) ^' GXPi Hm ^ X l X)i 



> inf f(z\t) 

z£C Sm (y),\\t-x\\ 2 <s m 



^Y' 2 < 



8da m ( (5 m /cr. 

cxp ' 



2 

m 1 



(2vr) 1 /2 ( 5, 

_ ,d x /2 ex P{~ R m s m} 

d x /2 

The expression on the last line of this inequality converges to 1 by (4.3). 
The rest of the proof is exactly the same as the proof of Proposition 2.1. □ 

Proof of Corollary 4.1. The proof of part (i) is identical to the 
proof of Corollary 2.1 part (ii). 

The proof of part (ii) is also similar to the proof of Corollary 2.1 part 

1/2 

(iii). Just set Sm = 5 m and note that (4.9) can be made arbitrarily smaller 
than the other parts of the bound by an appropriate choice of R m . Thus, 
the bound is the same as in (2.18), we just need to express m in terms of 
the number of mixture components in M3, mN(m). From the definition of 
N(m) and s m , N(m) = X(Bf l )~ 1 = d^' 2 s^' 2 . Since we set sll 2 = 5 m and 
S m = m -l/(*P+V(9-2)]) in the proof of Corollary 2.1, 

mN(m) = ^/2 m i+<W(rf-[2+i/(^2)])_ 

From this equation, one can express m as a function of mN(m) and plug it 
in (2.18) to obtain (4.12). □ 

Proof of Proposition 5.1. First, consider point-wise convergence 
a.s. F. For fixed (y,x) and an interval Cg m ^(y) with center y and length 
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S m (x) > 0, 



p{y\x,M i ) = Y J F{AJ{x)\x) ( i ) {y, l JLj{xU rn {x)) 
+ F(AZ(x)\x)<f>(y,0,Mx)) 



> 



inf f(z\x)J2KA?(x)nC Sm{x) (y)) 



(A.15) 



X <Ky,lif(x),(T m (x)) 



> inf /(z|x) 1 



.i 

6h m (x) 



*eC 4m(B) (i/) V (2vr) 1 /V m (x) 



16o- m (x) f (<5 m /<7r, 

■ exp< 



(2^)1/2^^) 



where the last inequality follows from Lemma A. 3 [if 5 m (x) — > and mp m — » 
1 then for any (y,x) there exists M such that Vm > M, Cgi x \{y) nA™(x) = 
and the lemma applies]. Convergence of the bound in (A.15) to f(y\x) a.s. 
F is implied by a.s. positivity and continuity in y of f(y\x) and conditions 
in (5.2). The rest of the argument establishing point- wise convergence is the 
same as for M.q [details are below (2.3)]. 

Next, let us derive an integrable upper bound for the DCT, 

m 

p(y\x, Ma) = F{Af{x) \x)<t>(y, fif(x),a m (x)) 

+ F(AZ(x)\x)4>(yAMx)) 
> [1 - 



(A.16) 



x inf f(z\x) 

zeCi(r(x),y,x) 



£ KAf(x)) 

j:Af(x)cCi(r(x),y,x) 



+ lA™(x)(y)- inf J(z\x) ■ X(C (r(x),y,x)) 



a 

x 0(y,O,o- o (aO)- 
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Lemma A. 3 and condition (5.4) imply that the sum in (A. 16) is bounded 
below by 1/2 — 1/4=1/4 for all sufficiently large m. Equation (5.5) implies 

, L f(y\x) 

log max < 1, 



' p(y\x,M 



i 



<logmax<l, ■('(»)/»)-' 



in{ zeC (r(x),y,x) f(z\x) ■ (j)(y, 0, <T (x)) 

(A.17) < log — — 1 max! 4>(y, 0, <r (x))(r(x)/2), 

</>{y,0,(To(x))(r(x)/2) { 

f(y\x) 

< -log[0(y,O,a o (x))r(x)/2] + lo. 



f(y\x) 



' m ^zeC(r{x),y,x) f{A x ) 



Inequality (A.17) follows by (5.5). The first expression in (A.17) is integrable 
by Assumption 5.1, part 6. The second expression in (A.17) is integrable by 
Assumption 5.1, part 3. This completes the proof of the proposition. □ 

Proof of Corollary 5.1. It suffices to show that Assumption 5.1 is 
satisfied. First, let us obtain a suitable h m . Note that 

(A.18) p m > [ f(y\x)dy>\(Af(x)r\[ ai {x)M(x)])f_. 
J AT(x)n\ai(x),bi(x)} 



Also, 



Prn > I f(y\x)dy 
JAf(x)n[a{x),a 1 (x)] 



(A.19) > / I- [y - a(x)] n dy 

JAf(x)n[a(x),a 1 (x)) 

> (n + l)- l \{Af{x) n [a(x), ai (x)}) n+l l 

and similarly p m > (n + l) -1 X(A l J l (x) n [b±(x), b(x)]) n+1 f . Combining this 
inequality with (A.18) and (A.19) we get for all x and j, 

< j = h m . 
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For (J m {x) =pm 4 '" +1 ' and 5 rn (x) =pm 8 ' n+1 ' conditions (5.1), (5.2) and (5.4) 
hold. 

Next, let C(r,y,x) = [y,y + r] if y G (a(x),a 1 (x) +r/2), C(r,y,x) = [y - 
r/2,y + r/2] if y G [ai(x) + r/2,6i(x) - r/2] and C(r,y,x) = [y — r/2,y] if 
2/ G — r/2,b(x)). By condition 4 of the corollary in.i z eC(r(x),y,x) f( z \ x ) = 

f(y\x) for y [ai(x) + r/2,b\{x) - r/2). For y G [a\(x) + r/2,b\(x) - r/2], 
' mi zeC(r(x), y ,x) f( z \ x ) > f_ and 

/ — f{vlx) f , i -M dy, dx) < log(7//) < oo. 

J mt z£C(r(x),y,x) J( z \ x ) ~ 

Condition 2 and (5.5) in Assumption 5.1 are assumed in the corollary. Since 
a(x) and b(x) are assumed to be square integrable, the second moment of y 
is finite, and condition 6 of Assumption 5.1 holds. □ 

Lemma A.l. Define a hypercube C$(y) = {/j, G R d :yi < \i% < y% + S, i = 
l,...,d}. Ze£ Ai,...,A m be adjacent hypercubes with centers /x,- and side 
length h such that C$(y) C Uj=iA? an< ^ ^ > 3d 1 / 2 /!. Define J = {j :Aj C 
C 5 (y)}. TTien 



S^\{Aj)(t>{y\fMj,a)> I 4>(n;y;a)d[i 



c s{y) ^ , » (2vr)W 



By symmetry, this result holds for any hypercube with vertex at y and side 
length 5. This implies that for hypercube D$(y) = {x:yi — 5/2 < x% < y% + 
5/2,i = l,...,d}, 



V A(A j )0(y;^ i ,cr) > / <f)(n;y;a) dfj, - 2 



,3d 3 / 2 (5 /2) d ' l h 



j:AjCD 5 {y) 



(2n) d / 2 



a 



d 



as long as D$(y) C Uj=i A? an d $ > 6d 1 / 2 h. 



Proof. For j £ J let Bj = {x : fiji < X{ < fiji + h, i = 1, . . . , d} be a 
shifted and rotated version of Aj. Note that [ij = argmax^g^. <^(/x; y; a), 
and therefore 



;/^,<r) 



y2X(Bj)(f)(y,fj,j,a)> (f)(fi;y,o-)dn 
~7i J\\^,b, 



> / <l>(fi;y;<T)dfj,- / (j>((i;y;o)dn. 
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Since {x : minj fiji <Xi< max j fiji, i = 1, . . . , d} C Cs(y)f) [\Jj Bj] and max je j fj,ji 
miiijejUji >8- 3d 1 ^ 2 h, we get X(C s (y) n [[jj B j]) >(S- 3d 1 / 2 h) d and 



A C 5 (y) 



(ci(v)n 


U b j 







< S d - (5 - 3d 1 / 2 h) d < toPPhfi*- 1 , 
where the last inequality follows by induction. Thus, 



I 



0(/i; y; a) d/j, < A C s (y) 



( 2vr )d/2 CT d 



< 



M z / 2 h5 d ~ 1 
{2ir) d l 2 a d ' 



□ 



Lemma A. 2. Let C$(y) be a d- dimensional hypercube with center y and 
side length 5 > . Then 

f ,. . , 1 8dcr/<5 f (W 2 
/ 0(/x;y;g)d^>l- . exp<^ — 

^Vof e £/iai this inequality immediately implies that for any sub-hypercube of 
C${y), C , with vertex at y and side length 5/2, for example, C = Cs(y)D [fi > 

y], 



[ <j>{fi;y,(T)d(j,= - s [ (f>(n;y;a)dn 
JC 2 d Jcx (y) 



C s (y) 
1 Ma/ 5 



(S/a) 



Proof. 



4>(v;y;a)dii 



> 2 d 2 d (27r)V2 exp \ ' 8 
4>(n; 0; cr) dfi 



n?=i[i«i<«/2] 



U?=i[lwl>*/2] 



c6(/z; 0; cr) cfyi 



> 



1 J\lH\>S/2 
oo 



0()tii; 0; tr) d/x, 

/>oo 

l — 2d 4>(fii;0;a) dfii 

JS/2 

d Z" 00 / 0.5(5/2)//i 



(27r)V2 a J 5/2 



d/xi 
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■exp{-0.5(V2)w/^ 2 }ir 



8d(a/S) f (S/a^ 2 

1 - /„ M/O eX P 



(2vr)V2 8 J' □ 

Lemma A. 3. Let A\, . . . , A m be a partition of an interval on R such that 
X(Aj) < h and fij £ Aj, Assume C$(y) = [y — 8, y + 8} C L)Aj is an interval 
with center y and length 8. Then 

If Cg(y) = [y — 8, y] or C$(y) = [y,y + 8] the lower bound in the above expres- 
sion should be divided by 2. 

Proof. Let J = {j :Aj n Cs(y) C [y - 8,y}}. For any j G J and fi € 
Aj nCs(y), fJ> — h< fij as A( A,- ) < h and \ij € , which implies 4>(y, [ij , <r) > 
4>(y, fj, — h,a). Therefore, 

(A.20) VA^nQd/j^t/./jj,^ / 4>{y,iJi-h,o-)dii. 
Note next that 

(t>{y,H-h,(r)dn 

U 3eJ lA 3 nc s (y)] 

r-y-h ry-'ih 
> I (p(y,H-h,a)dfj,= / <p(y,n,a)dfM 

Jy—S Jy—S—h 

rv 

= / (t>(y,v,o-)d/i 

Jy-5 

ry-6 ry 

(j)(y,fi,a)dfi- / (j)(y,fi,a)dfj, 

ly—8—h J y—2h 

rv 3/j 



By symmetry the same results can be obtained for J = {j :Aj n C$(y) C 
[y,y + <5]}. Thus 



V X(Aj nC s (y))cj)(y,fij,a) > / cf)(y, fi,a) dfi - 2 . 
~[ Jy-S (2vr) i /^ C r 



The claim of the lemma follows by Lemma A. 2. □ 
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